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Abstract 

Displaying a document in Middle Eastern languages 
requires contextual analysis due to different presenta¬ 
tional forms for each character of the alphabet. The 
words of the document will be formed by the joining of 
the correct positional glyphs representing corresponding 
presentational forms of the characters. A set of rules 
defines the joining of the glyphs. As usual, these rules 
vary from language to language and are subject to in¬ 
terpretation by the software developers. 

In this paper, we propose a machine learning ap¬ 
proach for contextual analysis based on the first order 
Hidden Markov Model. We will design and build a 
model for the Farsi language to exhibit this technol¬ 
ogy. The Farsi model achieves 9j % accuracy with the 
training based on a short list of 89 Farsi vocabularies 
consisting of 2780 Farsi characters. 

The experiment can be easily extended to many lan¬ 
guages including Arabic, Urdu, and Sindhi. Further¬ 
more, the advantage of this approach is that the same 
software can be used to perform contextual analysis 
without coding complex rules for each specific language. 
Of particular interest is that the languages with fewer 
speakers can have greater representation on the web, 
since they are typically ignored by software developers 
due to lack of financial incentives. 

Key Words: Unicode, Contextual Analysis, Hid¬ 
den Markov Models, Big Data, Middle Eastern Lan¬ 
guages, Farsi, Arabic 

1 Introduction 

One of the main objectives of the Unicode is to pro¬ 
vide a setting that non-English documents can be eas¬ 
ily created and displayed on modern electronic devices 
such as laptops and cellular phones. Consequently, this 


encoding has led to development of many software tools 
for text editing, font design, storage, and management 
of data in foreign languages. For commercial reasons, 
the languages with high speaking populations and large 
economies have enjoyed much more rapid advancement 
in Unicode based technologies. On the other hand, less 
spoken languages such as Pushtu is barely given atten¬ 
tion. According to m, approximately 40 to 60 million 
people speak Pushtu worldwide. 

Many Unicode based technologies are based on pro¬ 
prietary and patented methods and thus are not avail¬ 
able to the general open source software developers’ 
communities. For example, BIT |[9| does not reveal its 
contextual analysis algorithm for Farsi [TO]. Many soft¬ 
ware engineers need to redevelop new methods to im¬ 
plement tools to mimic these commercial technologies. 
The new contextual analysis for Farsi developed by 
Moshfeghi in Iran Telecommunication Research Cen¬ 
ter is an example of these kinds of efforts[lOl. 

The Unicode also introduces a challenge for the in¬ 
ternationalization of any software regardless of being 
commercial or open source. Tim Bray [2] writes: 

“ Whether you’re doing business or aca¬ 
demic research or public service, you have to 
deal with people, and these days, it’s quite 
likely that some of the people you want to deal 
with come from somewhere else, and you’ll 
sometimes want to deal with them in their 
own language. And if your software is unable 
to collect, store, and display a name, an ad¬ 
dress, or a part description in Chinese, Ben¬ 
gali, or Greek, there’s a good chance that this 
could become very painful very quickly. 

There are a few organizations that as a 
matter of principle operate in one language 
only (The US Department of Defense, the 
Acadmie franaise) but as a proportion of the 
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world, they shrink every year.” 

This internationalization is a costly effort and sub¬ 
ject to availability of resources. As mentioned above, 
languages with high speaking population such as Man¬ 
darin attract a lot of the efforts. The availability of 
data in Unicode represents an opportunity to employ 
machine learning techniques to advance software in¬ 
ternationalization and foreign text manipulation. The 
language translation technologies heavily use Hidden 
Markov Models to improve translation accuracy m ra¬ 
in this paper, we propose the use of HMM for con¬ 
textual analysis. In particular, we design and build a 
generic HMM for Farsi that can be easily adapted to 
other Middle Eastern languages. 

In section [5] we provide some background and re¬ 
lated work on contextual analysis. Section [3] will pro¬ 
vide a brief introduction to first order HMM. In sec¬ 
tion [4j we describe the design and implementation of 
our HMM for Farsi contextual analysis. The training 
and testing of HMM will be explained in section [5| Fi¬ 
nally, section [6] describes our conclusion and proposes 
future work. 

2 Background 

In 2002, the Center for Intelligent Information Re¬ 
trieval at the University of Massachusetts, Amherst, 
held a workshop on Challenges in Information Retrieval 
and Language Model [7]. The premise of this workshop 
was to promote the use of the Language Model technol¬ 
ogy for various natural languages. The aim is to use 
the same software for indexing and retrieval regard¬ 
less of the language. It was pointed out that, by using 
training materials such as document collections, we can 
automatically build retrieval engines for all languages. 
This report was one of the reasons that we decided to 
start a couple of projects on Farsi and Arabic [16] m 

Consequently, these projects led to developments 
of the two widely used Farsi and Arabic Stemmers 
naira- One of the difficulties we had was the lack 
of technologies for input and display of Farsi and Ara¬ 
bic documents [19] . For example, we needed an in¬ 
put/display method that would allow us to enter Farsi 
query words in a Latin-based operating system without 
any special software or hardware. It was further nec¬ 
essary to have a standard character encoding for text 
representation and searching. At the time, we devel¬ 
oped a system that provides the following capabilities: 

• a web-browser based keyboard applet for input 

• if the web-browser has the ability to process and 
display Unicode content, it will be used 


• if the browser cannot display Unicode content, an 
auxiliary process will be invoked to render the 
Unicode content into a portable bitmap image 
with associated HTML to display the image in the 
browser. 


Another area of difficulty that we encountered is 
that the presence of white space used to separate words 
in the document is dependent on the display geometry 
of the glyphs. Since Farsi and Arabic are written using 
a cursive form, each character can have up to four dif¬ 
ferent display glyphs. These glyphs represent the four 
different presentation forms: 


isolated: the standalone character 
initial: the character at the beginning of a word 
medial: the character in the middle of a word 
final: the character at the end of a word 


We found that depending on the amount of trail¬ 
ing white space following a final form glyph, a space 
character may or may not be found in the text. This 
situation came to light when our subject matter ex¬ 
perts were developing our test queries. We found that 
since the glyphs used to display the final form of char¬ 
acters had very little trailing space, they were manually 
adding space characters to improve the look of the dis¬ 
played queries. 

2.1 Keyboard Applet 

The keyboard applet was written in java script. The 
applet displays a Farsi keyboard image with the ability 
to enter characters from both the keyboard and mouse. 
The applet also handles character display conversion 
and joining of the input data. 

The keyboard layout is based on the ISIRI 2901:1994 
standard layout as documented in an email by Pour- 
nader []2] • Figure [I] shows the keyboard applet being 
used to define our test queries for search and retrieval. 

Display of the input data is normally performed by 
using the preloaded glyph images. However, if a char¬ 
acter has not been preloaded, it can be generated on 
the fly. Most of the time, these generated characters are 
“compound” characters. Farsi (and other Arabic script 
languages) may use ’’compound” characters which are 
a combination of two or more separate characters. For 
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Figure 1. Example use of the keyboard applet 


example, the rightmost character of U> \y >-, the Farsi 
word for “date” (that is, the fruit), is a combination of 
a kh with a damma. 

The complications associated with our work on Farsi 
and Arabic convinced us that we need to develop 
generic machine learning tools if we want to develop 
display and search technologies for most of the Middle 
Eastern languages. In the next few sections, we will 
offer a solution to contextual analysis to display the 
correct presentational forms of characters. 

3 Hidden Markov Model 

An HMM is a finite state automaton with proba¬ 
bilistic transitions and symbol emissions mm- An 
HMM consists of: 

• A set of states S = {si, • • • s„}. 

• An emission vocabulary V = {w\ ■ ■ ■ w n }. 

• Probability distributions over emission symbols 
where the probability that a state s emits symbol 
w is given by P(w\s). This is denoted by matrix 

B. 

• Probability distributions over the set of possible 
outgoing transitions. The probability of moving 
from state s* to Sj is given by P(sj|s,). This is 
denoted by matrix A. 

• A subset of the states that are considered start 
states, and to each of these is associated an “ini¬ 
tial” probability that the state will be a start state. 
This is denoted by n. 


As an example, consider the widely used HMM m 
that decodes weather states based on a friend’s activ¬ 
ities. Assume there are only two states of weather: 
Sunny, Rainy. Also assume there are only three activ¬ 
ities: Walking, Shopping, Cleaning. 

You regularly call your friend who lives in another 
city to find out about his activity and the weather sta¬ 
tus. He may respond by saying “I am cleaning and 
it is rainy”, or “I am shopping and it is sunny”. If 
you collect a good number of these weather states and 
activities, you then can summarize your data as the 
HMM shown in Figure [3] 

This HMM states that on rainy days, your friend 
walks 10% of the days while on sunny days, he walks 
60%. The statistics associated with this HMM is ob¬ 
tained by simply counting the activities on rainy and 
sunny days. 

You also notice arrows from states to states that 
keeps track of weather changes. For example, our 
HMM reflects the fact that on a rainy day, there is 
a 70% chance of rain next day while 30% chance of 
sunshine. 

In addition, one can keep track of how many days 
in the data are sunny or rainy. This will be the initial 
probabilities. Formally these statistics are calculated 
by Maximum Likelihood Estimates (MLE). Formally, 
transition probabilities were estimated as: 


P(Sj, Sj) 


Number of transitions from Sj to Sj 
Total number of transitions out of Sj 


( 1 ) 


The emission probabilities are estimated with Maxi¬ 
mum Likelihood supplemented by smoothing. Smooth¬ 
ing is required because Maximum Likelihood estima- 
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tion will sometimes assign a zero probability to unseen 
emission-state combinations. 

Prior to smoothing, emission probabilities are esti¬ 
mated by: 

Number of times w is emitted at s 
ml Total number of symbols emitted by s 

( 2 ) 

walk 0.1 walk 0.6 



Figure 2. An HMM for Activities and weather 


The most interesting part of an HMM is the decod¬ 
ing aspect. We may be told that our friend’s activi¬ 
ties for the last four days were cleaning, cleaning, 
shopping, cleaning and we want to know what the 
weather patterns were for those four days. This 
essentially translate to finding a sequence of four 
states S 1 S 2 S 3 S 4 that maximizes p(siS 2 S 3 S 4 | cleaning 
cleaning shopping cleaning). This amounts to 
choosing the highest probability among 16 choices for 
S 1 S 2 S 3 S 4 . This is computationally very expensive as 
the number of states and symbols increases. The so¬ 
lution is given by the Viterbi algorithm that finds an 
optimal path using dynamic programming |14| . The 
algorithm [I] is a modification of the pseudo code from 

Ell¬ 
in the next section we will describe the design 
and implementation of an HMM for Farsi contextual 
analysis. 


4 Farsi Hidden Markov Model 

The Farsi HMM is very similar to the example of 
HMM described in the previous section. The HMM 
has a state for each presentation form of Farsi alpha¬ 
bet. Also the HMM has a vocabulary of size 32, one 
for each character in Farsi alphabet. A simple cal¬ 
culation reveals that the Farsi HMM should have 128 
states and 32 vocabulary. The HMM has fewer than 
128 states since some of the characters do not have 4 
presentational form. For example, there are only two 


ALGORITHM 1: Viterbi Algorithm 

Data: Given K states and M vocabularies, and 
a sequence of vocabularies 
Y = W1W2 ■ ■ ■ w n -iw„ 

Result: The most likely state sequence 

R = nr 2 ■ • ■ r n _ir n that maximizes the 
above probability 

Function Viterbi(V, S, n, Y, A, B) : X 

for each state Si do 
Ti[i, 1] = R * Bi wi ; 

T 2 [i, 1] = 0; 

end 

for i = 2, 3,..., n do 

for each state Sj do do 

Ti[j, i} = maxk(Ti[k, i — 1] * A kj * Bj Wi ); 
T 2 [j,i] = 

argmaXk(Ti[k, i - 1] * A kj * B jWi ); 

end 

end 

z n = argmax k (Ti[k, n]) 
r n = s Zn 

for i = n,n — 1,..., 2 do 
Zi-i = T 2 [zi, 1 ]; 

r;-i — S, 4 _i; 

end 

Return R 


states for the character I, as there are no medial or 
initial form for this character. 


ij.'.t-j] Lm.*.•*] 

Figure 3. The four isolated characters on the 
left are vocabularies while the four characters 
on the right are states of the HMM 


As an example, suppose we want to type the word 
Jli-J;, in English jackal. On the keyboard, we type 

four isolated characters ji , 9 , I, and J. The 

HMM should decode these four characters as initial, 
medial, final, and isolated, respectively. In other 
words, the sequence of the four isolated characters (or 
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vocabulary in HMM terminology) should be decoded 
in the four states as shown in Figure [3] 

The part of the HMM as displayed in Figure. [4] shows 
how Viterbi algorithm takes the path to decode the 
correct form of the characters by choosing the appro¬ 
priate states. As we observe, there are four states for 
the character ^ representing the four shapes of this 
character. We also observe that there are only two 
states for the character I, as there are no medial or 
initial form for this character. 

A typical implementation of HMM adds states 
and vocabularies as being trained [2D]. The train¬ 
ing is done by providing pairs of the form 
{[wiw 2 ■ - .w n -iw n \, [S1S2 • ■ ■ Sn-iSn]) similar to the 
[vocabularies , states] sequences as shown in Figure [4j 


5 Training and Testing of HMM 


We trained the HMM with 89 words ( 2780 charac¬ 
ters ) chosen from the list of the frequent words from 
Kayhan newspaper published in 2005 [3]. There are 
over 10,000 words in this collection. We limited the 
training to this short list to save time. The list of these 
words are shown in Figure [5j 

The test data is a small number of words selected 
randomly from a small dictionary and shown in Fig¬ 
ure [5j This list contains 32 words ( 350 characters ). 
The training file contains pairs of words separated by 
a vertical bar. The first word is the isolated form and 
the second word is the correct presentational form of 
the word. We read the hie one line at a time and sub¬ 
mit the two words for training as seen in the following 
Ruby code: 


f = File.open(”./training-data”) 
farsi.train([” ”],[” ”]) 
f.each do - line — 

seql,seq2 = line.chomp.split(/s*—s*/) 
farsi.train(seql.split(” ”), seq2.split(” ”)) 

end 


As it is seen, we have added a blank vocabulary and 
state to our HMM. The HMM adds vocabulary and 
states as a part of the training. The HMM has 32 


vocabulary and 74 states. It is anticipated that the 
HMM will have more states as the size of the training 
data increases. 

The test correctly decoded 94% of all the charac¬ 
ters. Most of the mistakes are due to the fact that the 
HMM has not seen enough combination examples of 
characters. For example, in the word jiJI, the initial 
form of o was not decoded correctly. A closer exam¬ 
ination of the training data reveals that there are no 
occurrence of in the set. Similarly, there are other 
errors of this form such as the initial form of ^ in the 
word sj \j . There are also a few errors attributed to 

the double combination of the character as in 
We believe most of these errors will be corrected with 
a larger training sets. 


6 Conclusion and Future Work 


In this paper, We have presented a machine learn¬ 
ing approach to the contextual analysis of script lan¬ 
guages. It is shown that an ergodic HMM can be easily 
trained to automatically decode presentational forms of 
the script languages. 

Although the paper is developed based on Farsi, it 
can be easily extended to other middle eastern lan¬ 
guages. Further training and research in this area can 
improve the character accuracy. A successful program 
for contextual analysis may have to include a list of ex¬ 
ceptional words that do not fall into the normal combi¬ 
nation of the characters. It is also important to notice 
that most of the Arabic and Farsi type setting tech¬ 
nologies such as ArabTex [8] or FarsiTex [5] have prob¬ 
lems with contextual analysis. This is mainly due to 
the fact that it is practically impossible to devise an 
algorithm that has 100% accuracy for tasks associated 
with natural languages. 

Finally, a higher order HMM may also improve the 
contextual analysis. For example, it is shown that the 
second order HMM improves the hand written char¬ 
acter recognition [6]. It may also worth mentioning 
that the second order HMM does not improve error 
detection and correction for post processing of printed 
documents [T8] , 
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Figure 4. Glyphs Chosen by HMM 
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Figure 5. Top 89 frequent words from Kayhan newspaper published in 2005 
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